AI model Inference Optimization

Inference performance metrics

Latency metrics

= measure the time from when users send a query until they receive the complete response

Utilization metrics

= measure how efficiently a resource is being used

Inference Optimization

Inference optimization can be done at different levels:

Inference model optimization

Inference service optimization